Predicting the thermal efficiency of solar water heating systems under real-world operating conditions is a challenging task, primarily due to continuously changing weather patterns and the inherent thermal inertia of the system. Traditional analytical approaches, such as the Hottel–Whillier–Bliss (HWB) equations, rely on steady-state assumptions that are rarely satisfied in practical, large-scale installations. In this study, an autoregressive machine learning pipeline is developed to analyse operational data from a commercial solar thermal plant. The dataset consists of 4,674 SCADA records collected between 2016 and 2018, which were preprocessed to ensure data quality and consistency. Since direct irradiance measurements were unavailable due to missing pyranometer data, clear-sky Global Horizontal Irradiance (GHI) was estimated using the pvlib Ineichen model. Three regression models, Ordinary Least Squares (OLS), Random Forest, and XGBoost, were trained on a refined subset of 163 high-quality temporal sequences. To account for thermal lag, i.e., the delay between solar energy absorption and the system’s thermal response, additional temporal features were introduced, including a 60-minute rolling irradiance integral and a 10-minute lag of the target efficiency. These features improved the representation of system dynamics and simplified the underlying relationships. Among the models evaluated, the OLS model achieved the best performance (R² = 0.3664, RMSE = 0.1168), indicating that with appropriate feature engineering, simpler models can perform competitively against more complex machine learning techniques. The results also highlight the limitations of real-world SCADA data and provide insights into the trade-off between model complexity and interpretability in solar thermal system analysis.
Introduction
The text focuses on improving the performance prediction of solar thermal systems, particularly flat-plate collectors, under real-world conditions. Traditional analytical models assume steady conditions and often overestimate efficiency (by 15–25%) because they fail to account for environmental variability and thermal lag.
To address this, the study adopts a machine learning (ML)-based approach using real-world SCADA data from a solar plant. Unlike prior research that relies on ideal or synthetic datasets, this work emphasizes handling noisy, incomplete, and unstructured real-world data, which includes challenges like missing values and sensor inconsistencies.
A key issue tackled is thermal lag (10–20 minutes delay between solar input and system response), which reduces the effectiveness of models based only on instantaneous data. To overcome this, the study introduces autoregressive features, such as:
A rolling average of solar irradiance over time,
A lagged efficiency value from previous time steps.
The methodology includes:
Data cleaning and preprocessing of SCADA logs,
Reconstruction of solar irradiance using modeling tools,
Feature engineering to capture temporal dependencies,
Training ML models like Linear Regression, Random Forest, and XGBoost.
Results show that:
Models using only instantaneous data perform poorly,
Including temporal (lag-based) features significantly improves predictions,
Linear Regression performed best among tested models, highlighting that simpler models can outperform complex ones when data is limited and noisy.
Overall, the study demonstrates that incorporating time-dependent features is essential for accurate real-world solar system modeling, and that practical ML solutions must account for data imperfections and system dynamics.
Conclusion
In this study, a machine learning pipeline was developed to predict the thermal efficiency of a commercial solar water heating plant using raw SCADA telemetry. The results show that even noisy, unfiltered SCADA data can be effectively cleaned and paired with clear-sky irradiance estimates from pvlib to create a reliable training dataset. A major takeaway from the data was the heavy impact of thermal lag, which makes it very difficult to predict a commercial system\'s efficiency at any exact moment.
To get around this issue, we introduced autoregressive features into the model, specifically a 60-minute rolling average of irradiance and a 10-minute lag of the efficiency itself. This strategy successfully handled the system\'s thermal inertia, bumping the baseline R² score from a low 0.08 up to 0.3664. Interestingly, adding these time-based features smoothed out the data\'s non-linear behavior so well that a simple Linear Regression model actually ended up performing better than much heavier ensemble algorithms like Random Forest and XGBoost.
References
[1] H. C. Hottel and A. Whillier, \"Evaluation of flat-plate solar collector performance,\" Trans. Conf. Use of Solar Energy, vol. 2, pp. 74–104, 1955.
[2] R. W. Bliss, \"The derivations of several \'plate efficiency factors\' useful in the design of flat-plate solar heat collectors,\" Solar Energy, vol. 3, no. 4, pp. 55–64, 1959.
[3] J. A. Duffie and W. A. Beckman, Solar Engineering of Thermal Processes, 4th ed. John Wiley & Sons, 2013.
[4] S. A. Kalogirou, \"Applications of artificial neural networks in energy systems: A review,\" Energy Conversion and Management, vol. 40, no. 10, pp. 1073–1087, 1999.
[5] I. Farkas and P. Géczy-Víg, \"Neural network modelling of flat-plate solar collectors,\" Computers and Electronics in Agriculture, vol. 40, pp. 87–102, 2003.
[6] H. K. Ghritlahre and R. K. Prasad, \"Prediction of thermal performance of unidirectional flow porous bed solar air heater with optimal training function using Artificial Neural Network,\" Energy Procedia, vol. 109, pp. 369–376, 2017.
[7] C. Voyant et al., \"Machine learning methods for solar radiation forecasting: A review,\" Renewable Energy, vol. 105, pp. 569–582, 2017.
[8] R. Srivastava, A. N. Tiwari, and V. K. Giri, \"Solar radiation forecasting using MARS, CART, M5, and Random Forest model: A case study for India,\" Heliyon, vol. 5, no. 10, e02692, 2019.
[9] C. Fan, F. Xiao, and Y. Zhao, \"A short-term building cooling load prediction method using deep learning algorithms,\" Applied Energy, vol. 195, pp. 222–233, 2017.
[10] X. Qing and Y. Niu, \"Hourly day-ahead solar irradiance prediction using weather forecasts by LSTM,\" Energy, vol. 148, pp. 461–468, 2018.
[11] P. Kumari and D. Toshniwal, \"Deep learning models for solar irradiance forecasting: A comprehensive review,\" Journal of Cleaner Production, vol. 318, 128566, 2021.
[12] K. Leahy et al., \"Issues with Data Quality for Wind Turbine Condition Monitoring and Reliability Analyses,\" Energies, vol. 12, no. 2, 201, 2019.
[13] J. Zhang et al., \"A suite of metrics for assessing the performance of solar power forecasting,\" Solar Energy, vol. 111, pp. 157–175, 2015.
[14] R. Kumar and M. A. Rosen, \"A critical review of photovoltaic-thermal solar collectors for air heating,\" Applied Energy, vol. 88, no. 11, pp. 3603–3614, 2011.
[15] A. Shukla, D. Buddhi, and R. L. Sawhney, \"Solar water heaters with phase change material thermal energy storage: A review,\" Renewable and Sustainable Energy Reviews, vol. 13, no. 8, pp. 2119–2125, 2009.
[16] stritti, \"Realtime Thermal Solar Plant Dataset,\" GitHub, 2018. [Online]. Available: github.com/stritti/thermal-solar-plant-dataset
[17] F. Pedregosa et al., \"Scikit-learn: Machine learning in Python,\" Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[18] W. F. Holmgren, C. W. Hansen, and M. A. Mikofski, \"pvlib python: A python package for simulating solar energy systems,\" Journal of Open Source Software, vol. 3, no. 29, p. 884, 2018.
[19] P. Ineichen and R. Perez, \"A new airmass independent formulation for the Linke turbidity coefficient,\" Solar Energy, vol. 73, no. 3, pp. 151–157, 2002.
[20] L. Breiman, \"Random forests,\" Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[21] T. Chen and C. Guestrin, \"XGBoost: A scalable tree boosting system,\" Proc. 22nd ACM SIGKDD, pp. 785–794, 2016.